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ABSTRACT 

ExoLocator (http://exolocator.eopsf.org) collects in 
a single place information needed for comparative 
analysis of protein-coding exons from vertebrate 
species. The main source of data— the genomic se- 
quences, and the existing exon and homology anno- 
tation—is the ENSEMBL database of completed 
vertebrate genomes. To these, ExoLocator adds 
the search for ostensibly missing exons in 
orthologous protein pairs across species, using an 
extensive computational pipeline to narrow down 
the search region for the candidate exons and find 
a suitable template in the other species, as well as 
state-of-the-art implementations of pairwise align- 
ment algorithms. The resulting complements of 
exons are organized in a way currently unique to 
ExoLocator: multiple sequence alignments, both 
on the nucleotide and on the peptide levels, clearly 
indicating the exon boundaries. The alignments 
can be inspected in the web-embedded viewer, 
downloaded or used on the spot to produce an 
estimate of conservation within orthologous sets, 
or functional divergence across paralogues. 

INTRODUCTION 

The whole-genome sequencing projects, completed (1) or 
still under way (2), are bringing the comparative analysis 
of genomic and protein sequences (3) to a whole new level 
of insight and rehabihty, giving the impetus to the field. 
However, assembling an exhaustive set of recognizably 
related sequences, be it protein or nucleotide, such that 
they are complete, and their source clear, remains a pains- 
taking task. ExoLocator aims to alleviate the problem for 
the case for which it is currently feasible: protein coding 
sequences from the fully sequenced vertebrate genomes. 
Protein coding sequences tend to be easier to locate on 



the genome than the rest of the functional material 
therein, and the homologues from different species are 
also easier to faithfully ahgn once translated to the 
amino acid alphabet. Working with the completed 
genomes enables us to estabhsh the cognate sequences in 
various species that are the closest mutual homologues. 
These can then be used as templates for the similarity 
search to locate the full complement of exons for each 
studied gene. Finally, to organize the database in search- 
able chunks, we use human genome as the organizing 
point, the type of organization that commonly agrees 
with the search criteria used in biomedically motivated 
analysis. 

The need for such data compilation available in a single 
place has been recognized in the community by the earlier 
servers of related nature (4-7), though smaller in scope 
than our effort presented here. 

RESULTS AND THEIR PRESENTATION 

Information collected in ExoLocator 

ExoLocator takes ENSEMBL (1) database as its primary 
source of information. The data set is organized using 
human genome as the orientation map. All human genes 
annotated in the ENSEMBL as 'known' and 'protein 
coding' are collected, and their identifier used as a refer- 
ence for the whole group of vertebrate genes annotated as 
orthologues (one-to-one or one-to-many) by the 
ENSEMBL annotation pipehne. 

According to ENSEMBL, some exons do not seem to 
have a counterpart in closely related orthologous genes, 
and ExoLocator is in part an investigation into the possi- 
bihty that they were overlooked in the annotation process. 
As an estimate of the amount of information lost, we take 
all canonical exons from human protein coding genes, and 
align them with the exons from the genes annotated 
as orthologous in other species. In these ahgnments, 
some 15% of expected orthologues of human exons 
appear absent. In our pipeline, 85yo of the regions 
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where we expect to find the exons (see 'Materials 
and Methods' section) contain explicitly indicated 
un-sequenced stretches of genomic sequence. Two to 
three percent of the missing exons are still recoverable, 
in full or in partial length, by search by homology. 

The exons found by the pipehne are added to the overall 
collection, and organized in several different ways for 
display and downstream analysis. 

ExoLocator's web interface 

The database offers for download the set of protein coding 
exons compiled from the ENSEMBL, complemented with 
a straightforward homology search. It also provides the 
most complete reconstruction of full protein sequences we 
can achieve in this approach. 

ExoLocator's interface provides several ways to inspect 
the data: as lists of exons corresponding to a gene in each 
species, as an ahgnment of orthologous proteins with the 
exon boundaries indicated, as an ahgnment of within- 
species paralogues, or as an alignment of alternative 
splices. The last option is available only for the cases 
that have Consensus CDS annotation (8). The ahgnments 
are available at the nucleotide and amino acid levels. The 
visualization of the alignment is provided by the browser 
embedded JalView alignment viewer (9). The orthologue 
alignment comes with a set of notes, detailing the hst of 
the exons it contains — their position in the gene, the 
source: the ENSEMBL itself, Havana annotation project 
(10) or the similarity search using the closest detectable 
homologue (see 'Materials and Methods' section). 

The search in ExoLocator can be done by providing the 
ENSEMBL identifier, pasting in the sequence on the 
protein level or through a limited name resolution search. 

MATERIALS AND METHODS 

The original exon set available at ENSEMBL, release e73, 
our main source of raw genomic data, was assembled 
through a combination of de novo gene detection and 
heuristic search (BLAST) by similarity. To that arsenal 
of methods we have added a pairwise ahgnment (or simi- 
larity search) algorithm by Edgar (11), and our in-house 
implementation of a hardware accelerated version of 
Smith-Waterman search (12). 

In addition to being an extensive exercise in mining the 
ENSMBL core data, the database also provides an insight 
into the extent to which the number of known exons can 
be extended by optimal sequence ahgnment (applied 
to detection of homologous sequences across species). 
To estabhsh the search pipehne, implemented in Python 
2.6, we had to make several decisions, and develop appro- 
priate software. 

Pipeline description 

The first decision we make is to select a canonical set of 
exons for each human gene in ENSEMBL to use as the 
reference points in our search. Where exons overlap or 
disagree, exons annotated as 'known' with the greatest 
length and coverage are chosen over the others. This we 
do by modeling exons as nodes in a directed acyclic graph. 



with edges going from overlapping exons with greater 
quahty — measured by the strength of the annotation 
(Havana over ENSEMBL; strongly supported splice 
signal over none), the length and the similarity to a 
known template to existing species — to lesser, then 
taking the set of nodes with no incoming edges as our 
model set of exons for the gene. 

Then, to each human exon we attach a map to 'master' 
exons in the other species from the corresponding genes in 
other species. The maps are further reconciled in a full- 
length protein alignment, to detect and accommodate the 
cases of different intron positioning across species. An ad 
hoc pairwise aligner that respects exon boundaries is used 
for the purpose. For the final alignment on the multiple 
sequence level we use MAFFT ahgnment utihty (13). 

Search for missing exons 

To detect a missing exon we align a target vertebrate set 
of exons corresponding to a single gene to the most con- 
vincingly homologous set in human. To relate exons to 
their parent gene we again rely on the annotation 
provided by the ENSEMBL. The alignment provides 
the boundaries on the target gene for the search for 
the missing exon. Next, we need to choose a template 
from the species that has the exon annotated, and is in 
some sense the nearest to the species with the exon anno- 
tation missing. For that purpose we use the taxonomy tree 
available at the NCBI's Taxonomy Web site (14). We 
traverse the tree to look for the taxonomically closest 
species that has an exon mapping to the human at the 
expected place, and use it as a template for the sequence 
similarity search. 

Finally, to detect the region of homology, we use an 
advanced CPU implementation of a heuristic search, 
and GPU enabled Smith-Waterman algorithm. The im- 
plementations of the latter available in the public domain 
are not capable of handling the sizes for the input se- 
quences we have at hand, and therefore we use our own 
implementation in which the problem is divided into 
smaller chunks in a way amenable to graphics card accel- 
eration (https://github.com/mkorpar/swSharp). The exons 
that ExoLocator reports satisfy two criteria: their transla- 
tion must be longer then three residu es, and sim ilarity by a 

Tanimoto-like similarity measure ^JiSj^JLiLa), where Li 

and L2 are the lengths of the template and the candidate 
exon, and 5*12 is the similarity weighted length of the 
common ahgned positions, must be larger than 1/3. 

Known problems and caveats 

Some otherwise interesting pecuharities of animal 
genomes complicate systematic analysis of the sort we 
undertook here. Thus we do not attempt to resolve the 
cases of overlapping genes on the same strand, of which 
we detected several hundred cases (some of these, though, 
might be duplicate entries in the source database we 
are using). As we rely on the ENSEMBL pipehne for 
the annotation of at least approximate exon location, 
when the whole gene is unannotated in the species, it 
will be missing in our search too. Also, the detection 
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by similarity that we use does not allow us to decide on the 
precise location of the gene boundary. We use 
MaxEntScan (15), a Hghtweight implementation of a 
strong statistical tool to try to estimate the Ukelihood 
that the predicted exon has a proper splice signal. 
MaxEntScan in its currently available parametrization 
works the best for mammahan sequences of introns 
being spliced out by the major spliceosome. In these 
cases we use MaxEntScan to decide on the boundary 
and the possible phase of the exon, and provide the 
score MaxEntScan assigns in the accompanying notes. 

Database implementation 

The database that is accessible via Internet is a relatively 
straightforward MySQL application with a small number 
of tables storing the information about the exon coordin- 
ates, sequences and the source of the annotation. The pro- 
cessing pipeline used to fill the database, however, is a 
much more complex Python implementation sourcing 
the data from the local versions of the core ENSEMBL 
databases. The Web interface for the database was imple- 
mented in Play! Framework (http://www.playfraniework. 
org). With the full cycle of data processing being rather 
time-consuming, the database will be on a semiannual 
update schedule. 



CONCLUSION AND OUTLOOK 

At its current stage, ExoLocator aims to balance the goal 
of giving the complete picture of possible coding exons, 
with the need to stay grounded in terms of the verifiability 
of the actual function of the sequences it collects. Thus it 
relies on the ENSEMBL's annotation of 'known' ('known' 
here being the actual annotation term used, hence the 
quotes) human exons as the anchor for the search and 
for the results presentation. In certain cases, it seems 
that the data from species other than human argue for a 
different annotation, but we dehberately choose to stay 
away from any reinterpretation of the existing data. 
Rather, we envision the database to function as a 
shortcut to quick retrieval of the established sequences 
of well-documented human exons and their closest coun- 
terparts in the other species, complemented by the 
putative exon set collected by a reHable search utility. 
The hope is that, rather than as an interpretative tool, 
ExoLocator will be understood and used as a resource, 
ultimately leading to fuller understanding of the complex 
mechanism of gene function, alternative splicing and 
translation. 
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